{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# `cluster_studio_dashboard`\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "tags": [
          "hide_input"
        ]
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "\n",
              "        <iframe\n",
              "            width=\"100%\"\n",
              "            height=\"1000\"\n",
              "            src=\"./img/cluster_studio.html\"\n",
              "            frameborder=\"0\"\n",
              "            allowfullscreen\n",
              "            \n",
              "        ></iframe>\n",
              "        "
            ],
            "text/plain": [
              "<IPython.lib.display.IFrame at 0x128914dc0>"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from IPython.display import IFrame\n",
        "IFrame(src=\"./img/cluster_studio.html\", width=\"100%\", height=1000)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "!!! info \"At a glance\"\n",
        "\n",
        "    **API Documentation:** [cluster_studio_dashboard()](../api_docs/visualisations.md#splink.internals.linker_components.visualisations.LinkerVisualisations.cluster_studio_dashboard)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Worked Example"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "tags": [
          "hide_output"
        ]
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.\n",
            "----- Estimating u probabilities using random sampling -----\n",
            "u probability not trained for dob - Abs difference of 'transformed dob <= 1 month' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n",
            "\n",
            "Estimated u probabilities using random sampling\n",
            "\n",
            "Your model is not yet fully trained. Missing estimates for:\n",
            "    - first_name (no m values are trained).\n",
            "    - surname (no m values are trained).\n",
            "    - dob (some u values are not trained, no m values are trained).\n",
            "    - city (no m values are trained).\n",
            "    - email (no m values are trained).\n",
            "\n",
            "----- Starting EM training session -----\n",
            "\n",
            "Estimating the m probabilities of the model by blocking on:\n",
            "(l.\"first_name\" = r.\"first_name\") AND (l.\"surname\" = r.\"surname\")\n",
            "\n",
            "Parameter estimates will be made for the following comparison(s):\n",
            "    - dob\n",
            "    - city\n",
            "    - email\n",
            "\n",
            "Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n",
            "    - first_name\n",
            "    - surname\n",
            "\n",
            "WARNING:\n",
            "Level Abs difference of 'transformed dob <= 1 month' on comparison dob not observed in dataset, unable to train m value\n",
            "\n",
            "WARNING:\n",
            "Level Jaro-Winkler distance of transformed email >= 0.88 on comparison email not observed in dataset, unable to train m value\n",
            "\n",
            "Iteration 1: Largest change in params was -0.466 in the m_probability of dob, level `Exact match on dob`\n",
            "Iteration 2: Largest change in params was 0.141 in probability_two_random_records_match\n",
            "Iteration 3: Largest change in params was 0.0319 in probability_two_random_records_match\n",
            "Iteration 4: Largest change in params was 0.0105 in probability_two_random_records_match\n",
            "Iteration 5: Largest change in params was 0.00435 in probability_two_random_records_match\n",
            "Iteration 6: Largest change in params was 0.00208 in probability_two_random_records_match\n",
            "Iteration 7: Largest change in params was 0.00109 in probability_two_random_records_match\n",
            "Iteration 8: Largest change in params was 0.000601 in probability_two_random_records_match\n",
            "Iteration 9: Largest change in params was 0.000342 in probability_two_random_records_match\n",
            "Iteration 10: Largest change in params was 0.000197 in probability_two_random_records_match\n",
            "Iteration 11: Largest change in params was 0.000115 in probability_two_random_records_match\n",
            "Iteration 12: Largest change in params was 6.75e-05 in probability_two_random_records_match\n",
            "\n",
            "EM converged after 12 iterations\n",
            "m probability not trained for dob - Abs difference of 'transformed dob <= 1 month' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n",
            "m probability not trained for email - Jaro-Winkler distance of transformed email >= 0.88 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n",
            "\n",
            "Your model is not yet fully trained. Missing estimates for:\n",
            "    - first_name (no m values are trained).\n",
            "    - surname (no m values are trained).\n",
            "    - dob (some u values are not trained, some m values are not trained).\n",
            "    - email (some m values are not trained).\n",
            "\n",
            "----- Starting EM training session -----\n",
            "\n",
            "Estimating the m probabilities of the model by blocking on:\n",
            "l.\"dob\" = r.\"dob\"\n",
            "\n",
            "Parameter estimates will be made for the following comparison(s):\n",
            "    - first_name\n",
            "    - surname\n",
            "    - city\n",
            "    - email\n",
            "\n",
            "Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n",
            "    - dob\n",
            "\n",
            "WARNING:\n",
            "Level Jaro-Winkler distance of transformed email >= 0.88 on comparison email not observed in dataset, unable to train m value\n",
            "\n",
            "Iteration 1: Largest change in params was 0.64 in probability_two_random_records_match\n",
            "Iteration 2: Largest change in params was 0.176 in probability_two_random_records_match\n",
            "Iteration 3: Largest change in params was 0.0846 in the m_probability of first_name, level `All other comparisons`\n",
            "Iteration 4: Largest change in params was 0.0268 in probability_two_random_records_match\n",
            "Iteration 5: Largest change in params was 0.0101 in probability_two_random_records_match\n",
            "Iteration 6: Largest change in params was 0.00431 in probability_two_random_records_match\n",
            "Iteration 7: Largest change in params was 0.00198 in probability_two_random_records_match\n",
            "Iteration 8: Largest change in params was 0.000936 in probability_two_random_records_match\n",
            "Iteration 9: Largest change in params was 0.00045 in probability_two_random_records_match\n",
            "Iteration 10: Largest change in params was 0.000218 in probability_two_random_records_match\n",
            "Iteration 11: Largest change in params was 0.000106 in probability_two_random_records_match\n",
            "Iteration 12: Largest change in params was 5.19e-05 in probability_two_random_records_match\n",
            "\n",
            "EM converged after 12 iterations\n",
            "m probability not trained for email - Jaro-Winkler distance of transformed email >= 0.88 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n",
            "\n",
            "Your model is not yet fully trained. Missing estimates for:\n",
            "    - dob (some u values are not trained, some m values are not trained).\n",
            "    - email (some m values are not trained).\n",
            "\n",
            " -- WARNING --\n",
            "You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\n",
            "Comparison: 'dob':\n",
            "    m values not fully trained\n",
            "Comparison: 'dob':\n",
            "    u values not fully trained\n",
            "Comparison: 'email':\n",
            "    m values not fully trained\n",
            "The 'probability_two_random_records_match' setting has been set to the default value (0.0001). \n",
            "If this is not the desired behaviour, either: \n",
            " - assign a value for `probability_two_random_records_match` in your settings dictionary, or \n",
            " - estimate with the `linker.training.estimate_probability_two_random_records_match` function.\n",
            "Completed iteration 1, root rows count 3\n",
            "Completed iteration 2, root rows count 0\n"
          ]
        },
        {
          "data": {
            "text/html": [
              "\n",
              "        <iframe\n",
              "            width=\"100%\"\n",
              "            height=\"1200\"\n",
              "            src=\"./img/cluster_studio.html\"\n",
              "            frameborder=\"0\"\n",
              "            allowfullscreen\n",
              "            \n",
              "        ></iframe>\n",
              "        "
            ],
            "text/plain": [
              "<IPython.lib.display.IFrame at 0x128701000>"
            ]
          },
          "execution_count": 2,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import splink.comparison_library as cl\n",
        "from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n",
        "\n",
        "df = splink_datasets.fake_1000\n",
        "\n",
        "settings = SettingsCreator(\n",
        "    link_type=\"dedupe_only\",\n",
        "    comparisons=[\n",
        "        cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.7]),\n",
        "        cl.JaroAtThresholds(\"surname\", [0.9, 0.7]),\n",
        "        cl.DateOfBirthComparison(\n",
        "            \"dob\",\n",
        "            input_is_string=True,\n",
        "            datetime_metrics=[\"year\", \"month\"],\n",
        "            datetime_thresholds=[1, 1],\n",
        "        ),\n",
        "        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n",
        "        cl.EmailComparison(\"email\"),\n",
        "    ],\n",
        "    blocking_rules_to_generate_predictions=[\n",
        "        block_on(\"substr(first_name,1,1)\"),\n",
        "        block_on(\"substr(surname, 1,1)\"),\n",
        "    ],\n",
        "    retain_intermediate_calculation_columns=True,\n",
        "    retain_matching_columns=True,\n",
        ")\n",
        "\n",
        "linker = Linker(df, settings, DuckDBAPI())\n",
        "linker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n",
        "\n",
        "blocking_rule_for_training = block_on(\"first_name\", \"surname\")\n",
        "\n",
        "linker.training.estimate_parameters_using_expectation_maximisation(\n",
        "    blocking_rule_for_training\n",
        ")\n",
        "\n",
        "blocking_rule_for_training = block_on(\"dob\")\n",
        "linker.training.estimate_parameters_using_expectation_maximisation(\n",
        "    blocking_rule_for_training\n",
        ")\n",
        "\n",
        "df_predictions = linker.inference.predict(threshold_match_probability=0.2)\n",
        "df_clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n",
        "    df_predictions, threshold_match_probability=0.5\n",
        ")\n",
        "\n",
        "linker.visualisations.cluster_studio_dashboard(\n",
        "    df_predictions, df_clusters, \"img/cluster_studio.html\",\n",
        "    sampling_method=\"by_cluster_size\", overwrite=True\n",
        ")\n",
        "\n",
        "# You can view the scv.html file in your browser, or inline in a notebook as follows\n",
        "from IPython.display import IFrame\n",
        "IFrame(src=\"./img/cluster_studio.html\", width=\"100%\", height=1200)\n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### What the chart shows\n",
        "\n",
        "See [here](https://youtu.be/msz3T741KQI?si=1VCK48bwENFcUyQS&t=2741) for a video explanation of the chart."
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "base",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.8"
    },
    "orig_nbformat": 4
  },
  "nbformat": 4,
  "nbformat_minor": 2
}
